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Abstract 

Both pixel-based scale saliency (PSS) and basis project methods focus on 
multiscale analysis of data content and structure. Their theoretical relations 
and practical combination are previously discussed. However, no models 
have ever been proposed for calculating scale saliency on basis-projected de- 
scriptors since then. This paper extend those ideas into mathematical mod- 
els and implement them in the wavelet-based scale saliency (WSS). While 
PSS uses pixel-value descriptors, WSS treats wavelet sub-bands as basis de- 
scriptors. The paper discusses different wavelet descriptors: discrete wavelet 
transform (DWT), wavelet packet transform (DWPT), quaternion wavelet 
transform (QWT) and best basis quaternion wavelet packet transform (QW- 
PTBB). WSS saliency maps of different descriptors are generated and com- 
pared against other saliency methods by both quantitative and quanlitative 
methods. Quantitative results, ROC curves, AUC values and NSS values 
are collected from simulations on Bruce and Kootstra image databases with 
human eye-tracking data as ground-truth. Furthermore, qualitative visual 
results of saliency maps are analyzed and compared against each other as 
well as eye-tracking data inclusive in the databases. 

Keywords: visual attention, visual saliency, scale saliency, discrete wavelet 
transform, quaternion wavelet transform, wavelet packet best basis 



1. Introduction 

A few centuries ago, Neisser proposed a fundamental theory about the 
human visual attention system including pre-attentive and attentive stages 
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15 [1] in his psychology studies. However, his work was unknown to machine vi- 

16 sion scientists until David Marr [2], a neurologist, proposed a neurology-based 

17 computational model for Neisser's theory. The computational model includes 
is a feature extraction stage followed by perceptual grouping stage. Though the 

19 model was practically limited and rarely implemented, it inspires and pro- 

20 vides framework for several later computational models. Among them, Itti 

21 model [3] holds significant influence and provides a standard in the research 

22 field. Itti models feature extraction as center-surrounds operator, a prop- 

23 erty of visual cortex; while, perceptual grouping and attentive region assess- 

24 ment are due to Proto-Object generation [I] and Winner- Take-All network 

25 [3] • After center-surrounds operations in multi-scale levels were proposed for 

26 construction of conspicuity and saliency maps by Itti et al, other theories 

27 like Graph-based Visual Saliency [5], and Spectral Residual Saliency [6] were 

28 brought in to produce more meaningful saliency maps j5] as well as reduce 

29 the computational complexity [6]. These saliency models assume that human 

30 vision systems may behave like random-walk processes [5] or follow statistical 

31 property of natural images [5]. 

32 Without making such strong assumption, Kadir [7] and Gilles [8] initi- 

33 ated information-based saliency map with their work on pixel-based scale 

34 saliency (PSS). Other information-related saliency research rapidly gained 

35 pace with Niel Bruce's An Information Maximization (AIM) theory jU] and 

36 Danash Gao's Discriminative Information Saliency (DIS) [TU]. Furthermore, 

37 the information-based spatial-temporal framework (ENT) [TT] [12] extends 

38 and fastens the models from still images to the dynamic video context. 

39 Information-based saliency approaches all motivated from the assump- 

40 tion that human attention could be attracted to spatial location accompa- 

41 nied with highly informative content. From signal coding, compression and 

42 self-information theory, an event has more information when it appears to be 

43 structural and rare. Though based on similar concepts, each method has its 

44 own information estimation approach on different type of descriptors. Gen- 

45 erally, those approaches can be characterized according to their choices of 

46 descriptors and calculation methods. For examples, PSS [7] and ENT [T2] 

47 utilize pixel-value descriptors; meanwhile, AIM [9] and DIS [10] emphasizes 

48 on the alternative basis-projection descriptors, ICA bases and Wavelet bases 

49 consecutively. In accordance with information measurement, ENT and PSS 

50 employ the popular Shannon entropy estimated by histogram construction 

51 or Parzen kernel. AIM estimates self-information through neural-network 

52 training on patches of natural images. Decision-theory based DIS has its 
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53 discriminative information from classifying descriptors into center or sur- 

54 rounds classes. Noteworthy, PSS is so far the only approach accumulating 

55 information of both descriptors and their structure. However, PSS employs 

56 pixel-value descriptors and isotropic circular sampling, which might hinder its 

57 performance in term of accuracy due to failure in extracting popular oriented 

58 features in natural images as well as speed due to the curse of dimensionality 

59 in information estimation. 

eo The limitation of PSS sparked discussion for alternative solutions by 

ei Kadir et al [7] [13]. Deploying basis-projected descriptors in place of pixel- 

62 based descriptors not only boosts practical performance of scale saliency 

63 but as well provides deeper theoretical understanding of scale saliency and 

64 data multi-scale structural information. Moreover, the extension would make 

65 scale saliency the first-ever method capable of using both pixel-value de- 
ee scriptors (PSS) and basis-projection descriptors. Wavelet elements are pre- 
67 ferred as alternative basis in this paper; therefore, the proposed method is 
es named Wavelet Scale Saliency (WSS). In order to clearly explain extension 

69 from pixel-based descriptors to basis-projection descriptor, we organize this 

70 paper in following sections. Section [2] gives overview about scale saliency 

71 and its main idea. The next section [3] explains the rationale behind usage 

72 of time-frequency domain instead of time-domain only for visual saliency; 

73 meanwhile, sections |4| [6] elaborates statistical distribution and correlation 

74 of time-frequency descriptors. As wavelet is chosen as time-frequency basis, 

75 section [5] gives background information about four types of wavelet trans- 

76 forms considered in this study: discrete wavelet transform (DWT), discrete 

77 best basis wavelet packet transform (DWPTBB) as well as two quaternion 

78 wavelet transforms QWT and QWPTBB. Accordingly, there are four time- 

79 frequency descriptors representing time-frequency domain slightly different 
so from each other. Moreover, each descriptor depends on a particular mor- 
al phological shape of its own mother wavelet. All details about properties 

82 of those descriptors are organized in section 6A_ Along with new descrip- 

83 tors, suitable mathematical models of feature-space and inter-scale saliency 

84 estimation are derived in section [7j Moreover, the mathematical derivation 
as unveils strong relation between WSS with another state-of-the-art Bayesian 
se Surprise Saliency (BSS) [H]. Beside theoretical evaluation, simulations on 
87 Neil Bruce image database (15] and Kootstra image database [16] are carried 
as out in order to compare quantitatively the proposed WSSs with different 

89 basis-projection descriptors, the original PSS and the ITT model. Further- 

90 more, qualitative analysis on particular images provides better details about 
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91 responses of the proposed methods with different types of scenes. It is pos- 

92 sible that performance of saliency methods depends on image content. All 

93 results and discussion are detailed in the section |8j Finally, the conclusion [9] 

94 summarizes our main contributions and future research directions. 



95 2. Scale Saliency 

96 To get a hold of what exactly is a scale saliency; a few fundamental princi- 

97 pies of original scales saliency are reviewed. Scale saliency utilizes maximum 

98 feature-space entropy weighted by its inter-scale dependency across scales as 

99 saliency values; furthermore, it argues that information measurement might 

100 be data-driven pivot for human visual attention. Its mathematical model is 

101 summarized as follows. 

Y D (s p ,x) = H D (s p ,x)W D (s p ,x) (1) 
H D (s p ,x) = - I p(d,s p ,x)log 2 (p(d,s p ,x))dd (2) 

JdeD 

W D {s p ,x) = s | 1 da (3) 

JdeD ° s 

s P = {s\ 6 ^4^ = 0; 62Hd x [ S ^ <0} (4) 
OS os z 

102 Feature-space saliency, (Hd) in the equation |2j is measured by its Shannon 

103 entropy of pixel-values descriptor (d) at a specific scale or sampling window 

104 size (s) for each image location (x). Shannon entropy is chosen since it 

105 satisfies fours over five criteria of multi-scale entropy filtering [T7]. The last 
loe criterion actually requires structural correlation from information estimation, 
107 which is apparently not considered in Shannon entropy. However, inter- 
los scale saliency, (Wd), actually fulfils this requirement, and it is estimated 
109 at every location by total variation of descriptors' probability distribution 
no function (PDF) across scales. Then, the scale (s p ) at which most significant 
in information should be found; it is actually the maximum point of the scale- 
n2 entropy concave curve in the equation |4| Finally, the overall saliency is 
in stated mathematically as the equation [T] in accordance with the definition 
H4 of scale saliency. Lets apply the concept of scale saliency on a general form 
us of signal R(x ,Si) = {I(x" ,Si) + N(x ,Si)\i = l...n}, where Ix , Si is ideal 
lie noise free signal, Rx- , Si is the measured signal with noise N^ 0jSi at specific 
H7 location and scale (x~q, Si). Assumed no dependencies between noise and the 
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118 ideal signal, the estimated entropy is H D (R x - , Si ) = H D (I x - 0tSi ) + H D (N^). 

119 Assumed that noise PDF are scale-invariant, the equation [3] implies that 

120 inter-scale saliency measure is purely dependent on variation of useful signal 

121 A S .H D (Ixb,si) an d not affected by variation of noise A S .H D (N x - 0tS .) = 0. 

122 This briefly explains basic motivation behind scale saliency work; further 

123 mathematical analysis and experiments results can be found in [7J [T5] . 

124 The original scale saliency [7J uses pixel-value descriptors which are sim- 

125 pie, intuitive, and straight forward interpretation of image data. Moreover, 

126 its combination with circular sampling window provides isotropic informa- 

127 tion analysis, independent of any morphological shape inside sampled re- 

128 gions. Nevertheless, its drawbacks are be susceptible to noise, require high 

129 computational cost and cause significant bias in entropy estimation. So far, 

130 histogram construction and approximated Parzel kernel are two popular pa- 

131 rameter methods for constructing pixel-value descriptors' PDF and esti- 

132 mating entropy. Entropy bias and speed performance in those mentioned 

133 methods greatly depend on manual tuning of histogram numbers of bins or 

134 Parzel size kernel; in addition, they as well restrict extension of scale saliency 

135 to higher dimensional data. Suau [18j overcomes these problems by bypass- 

136 ing pdf construction stage and estimating PSS by multivariate-data-adaptive 

137 information estimation technique jT9]. In spite of its fast computation for 

138 multivariate data, the non-pdf approach hinders the inter-scale saliency pro- 

139 cess which directly depends on PDFs |3j It is solved by adapting set-theory 

140 based elegant solutions of Kadir [13] for inter-scale saliency Wd computation 

141 into kd-tree structure. However, the solution is not intuitively and mathe- 

142 matically related to the information-based frame-work. That motivates us 

143 develop (WSS), a more coherent information-based scale saliency with sub- 

144 band energy descriptors, as solutions for all these short-comings of PSS. 

145 3. Time-Scale-Frequency 

we A well-known computational model of visual attention is first mentioned 

147 in Koch and Ullman's publication [20J. After that, several other models are 

us proposed; however, they are usually over-complex and not biologically plau- 

149 sible. The disadvantages might be due to pixel representation utilized in 

150 many early visual attention algorithm. To overcome these problems system- 

151 atically, Urban [21 J has investigated strong constraints to keep computational 

152 complexity within an acceptable range for possible real-time implementa- 

153 tion. These constrains are drawn from evidences of psychological experiments 
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Depth 





1 2 


3 


4 


5 


Frequency range 












(cycles per degree) 


10.7-5.3 


5.3-2.7 2.7-1.3 


1.3-0.7 


0.7-0.3 


0.3-0.2 



Table 1: Wavelet Levels vs Frequency Range 



154 which shows that images could be analyzed in psycho- visual channels at least 

155 in TV- viewing condition [22]. In other words, visual data could be further 

156 analysed into channels and sub-bands instead of being used in raw pixel 

157 format. Furthermore, the channels can be effectively characterized by sepa- 
ls rated frequency bands and orientation ranges of wavelet analysis [23] • Lets 
159 assume visual active areas of brains can deploy some 9/7 Cohen-Daubechies- 
leo Feauveau (CDF) wavelet transform operators; then, it results in multi-scale 

161 pyramid composed of oriented contrast maps with limited frequency range 

162 and low-resolution image. For each level of wavelet decomposition, there are 

163 four channels: (i) sub-band is approximated image after filtered with many 

164 low-pass blurring kernels; (ii) sub-band 1 extracts horizontal frequencies cor- 

165 responding to vertical edges of images (iii) sub-band 2 contains frequencies 
we and features along two diagonals of image frames, (iv) sub-band 3 prefers ver- 
167 tical frequencies mapping to horizontal features form images. Natural scenes 
lea are full of horizontal, vertical or two diagonals features; therefore, human 

169 visual perception seems to prefer those dominant features. Besides oriental 

170 constraints, visual acuity is another visually perceptual limit. Normally, hu- 

171 man fovea could decompose and process details above its limit visual acuity 

172 (1.5-2 degrees of visual angle). It lasts in frequency range: 0.7-0.5 pixels per 

173 degree,or 0.33-0.25 cycles per degree. This range is nearly resembled by the 

174 last level low-resolution version of images in usual wavelet decomposition. 

175 Each decomposed level are generated by moving kernels with different win- 

176 dow size to any image positions. Spatial frequency of other wavelet analysis 

177 levels, varying in accordance with analyzing depths, is shown the following 

178 table □ 

179 Spectral energy are usually employed as spectral signature for image col- 
lso lections or individual images. Urban et al [21] analyses different sets of images 

181 belonging to four different semantic categories: coast, mountain, street and 

182 open-country. Interestingly, Fourier spectrum of each category possess dis- 

183 tinguished shape and frequency range, significantly different from each other 

184 [21J. In other words, each general spectral profile and associated distribution 

185 histogram of image classes have unique energy distribution. This distribution 
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186 
187 



188 
189 
190 
191 
192 
193 
194 
195 
196 
197 
198 
199 
200 
201 
202 
203 
204 
205 
206 



207 
208 
209 
210 



211 
212 



is proportional to distance d from mean magnitude spectrum normalized by 
the standard deviation of the category. 



where AS(n, m) is average spectrum. Carefully observing the spectral pro- 
files could give distinguishing clues for each semantic scene. For example, 
"Coastal" scenes are dominated with horizontal features; therefore, its spec- 
tral profiles stretch along vertical axes. Furthermore, spectral profile of 
"OpenCountry" categories is biased toward two upper and lower spectrum. 
Though almost similar to the "OpenCountry" profile, spectrum of "Street" 
images includes more types of features from artificial environment beside 
horizon-oriented details. Therefore, the diamond of image spectral profile of 
"Street" becomes more significant horizontally. "Mountain" categories with 
its random scenic details have isotropic spectrum while scenes of streets filled 
with artificial objects have spectrum stretched in both horizontal and ver- 
tical axis. From Urban's research, spectral energy distribution seems to be 
important clues for visually perceptual system of human beings. 

Beside image classification, the spectral distribution signature is as well 
useful in visual attention and early visual process. Such energy distribution 
becomes differentiable clues for features across scales. Spectral profiles of 
image feature at a particular scale would help differentiate itself from directly 
upper and lower scale. Lets do an imaginary experiments with a single square 
input signal x{t) defined as follows. 



If x(t) is filtered by a kernel F() with kernel size ( 1-D kernel width ) 
W = AT and W is much smaller than non-zero period of the given square 
signal AT \t2 — tl\, the response will be just two impulse function at 
fl,f2. 



For 2-D signal or image context, the above operation corresponds to a classic 
edge detection phenomena. Though edges and structures plays important 
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213 roles in visual perception, their information does not sufficiently represent 

214 the whole natural scenes. Natural images are rich of other features like tex- 

215 ture, flat regions, etc beside edges and corners. As mentioned before, image 

216 features can be interpreted in terms of energy distribution. For example, 

217 edges are the places of high energy concentration, homogeneous flat regions 

218 do not contain much energy while textures, hybrid of edges and flat regions, 

219 contains certain amount of energy . If only one window size is used in the 

220 analysis, significant responses only come from features or objects which hap- 

221 pen to fit into that window size. The other useful features with inappropriate 

222 size in accordance with the filter could not be extracted. Therefore, a multi- 

223 scale's approach is extremely necessary in order to identify suitable sizes of 

224 kernels or fuse features from different scales together. When mother wavelet 

225 is chosen as filtering kernels, window size becomes equivalent to frequency 

226 range in the table [TJ and choosing adaptive frequency ranges is important 

227 computation task for spatial feature extraction. Inspired by such fundamen- 

228 tal query in computer vision, this paper tries to contribute a little insight 

229 about how spectral density distribution can characterize features at each 

230 scale and how the frequency range of processing can be appropriately cho- 

231 sen for multi-scale feature representation. From the multi-scale features and 

232 appropriate scale selection, we can develop computation saliency methods 

233 capable of highlighting salient features across scales by using spectral energy 

234 distribution. 



235 4. Time-Scale-Energy 

236 PSS estimates information from pixel values, time-domain descriptors by 

237 constructing normalized histogram of pixel values as probability distribution. 

Ph(d) = — 

238 where ph is probability of descriptor, the ratio between number of pixels with 

239 d descriptors and total image pixels N. Lets use square of pixel d 2 as weights 

oo 

/ (x(t)S(x(t) - d)f dt 

(fi\ - Ud * E± - 
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240 Normalized weighted-histogram p e (d), of signal x(t) £ L 2 (R) can obviously 

241 be interpreted as p x energy density of descriptors d in time domain. By 

242 the isometric property of the Fourier transform, the PDF of energy density 

243 distribution can be expressed in frequency domain as well. 



Pe(d) = |r = 



J x(f) 2 S(x(f) - d)df 

-oo 

oo 

I *(f) 2 df 



244 Or in joint time and frequency domain. 

Px Px(t,f) 



246 
247 
248 
249 
250 
251 
252 
253 
254 



Pe 



E, 



Px 



J J p,(t,f)dtdf 



— oo — OO 



where 



x ( T )9tj(r)dT 



where p 5 is energy density in joint time-frequency representation. Pure time 
descriptors have perfect localization in time, no localization in frequency , 
and vice versa for frequency descriptors. Both extreme time or frequency 
descriptors make interpretation of constructed PDF, and estimated infor- 
mation difficult to explain. Therefore, it is necessary to find a representa- 
tion of gl j(t) which describes spectral density of local energy. For example, 
Short-Time Fourier Transform (STFT) is the first-known transform capable 
of generating spectrogram, a graphical representation of local signal energy 
in time-frequency plan. 



Px(t,f) 



oo 



STFT identifies spectral density as well as local energy density or infor- 
mation in a short-time period of the signals. However, it does not much 
benefit scale saliency unless scale parameters are actually considered as in 
signal description on phase-space. Fortunately, in recent years, alternative 
scale-based representation, called wavelet-transform (WT), has been widely 
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260 addressed among signal processing community, and its fundamental idea is 

261 replacing the frequency shifting operation e~ 2 - ?7r ^ r by a time (or frequency) 

262 scaling operation ^f^— ), a basic wavelet kernel. Consequently, the energy 

263 density in WT framework is formulated as follows. 



Px(t,a) 



T — t 

x(r)ifj( )dr 



WT coefficients, p x (t,f), averagely measure spectral density of frequency 
sub-bands, a short range of frequency, in a short period of time. Character- 
istics of the time-frequency window are specified by two main parameters, 
time-shift r and scale a. As derived from short-time spectral representation, 
a spectrogram, by utilizing scale operations, its energy density distribution 
is called scalogram. Given time-scale space, the total signal energy can be 
rewritten as follows. 




p x (t, a)dtda 



271 and probability of time-scale descriptors can be specified with scale param- 

272 eters 

P.(t,.) = *P (5) 



E. 



fx 



273 Generally, the wavelet transform can help generate frequency sub-band co- 

274 efficients, square of which over total energy are density distribution of that 

275 sub-band in time-scale space. 



276 5. Wavelet Transform 

277 After discussing about usefulness of time-frequency-scale representation 

278 in the section [3] and its corresponding time-frequency energy distribution, 

279 we recognize that wavelet-representation would be ideal candidates for our 

280 investigation into energy density distribution and other statistical property 

281 across multiple scales of natural images. During the quite short history of 

282 wavelet analysis, this research fields have been very fruitful and there are sev- 

283 eral analysing techniques with wide range of characteristics. In this paper, 

284 only standard techniques such as discrete wavelet transform (DWT), dis- 

285 crete wavelet packet transform (DWPT), quaternion wavelet transform with 
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286 best-basis(QWTBB), and quaternion wavelet packet transform with best- 

287 basis, are deployed as possible descriptors. In this subsection, we first look 

288 into discrete and real wavelet and wavelet packet transform with best basis 

289 (DWT,DWPTPP) for its theoretical background; then QWT and QWPTBB, 

290 quaternion versions of two prior wavelet transforms, are considered. 

291 5.1. Discrete Wavelet Transform 

292 Though wavelets were firstly introduced in the early 20th century by 

293 Alfred Harr, they are only developed rapidly much later. Only until recently, 

294 they have been widely employed in many computer vision problems such as 

295 image or video de-noising, enhancement, coding, and pattern classification 

296 [2H [251 EE] . Signal analysis for frequency components can be achieved 

297 by Fourier transform (FT) but FT does not provide suitable tool for time- 

298 frequency analysis of images. Short-time Fourier Transform (STFT) is an 

299 extension from FT approach for analysing local frequency analysis at a short 

300 period of time [27] ■ Noteworthy that, STFT can be used for taking the spatial 

301 interval in 2-D signal instead of time period in 1-D type since there is no time 

302 dimension for still images. However, STFT utilizes fixed window kernels for 

303 every data blocks across input signals; this property make STFT less suitable 

304 for complex signal analysis, especially signals with strong semantic structures 

305 appearing across multiple scales. In other words, STFT only succeeds with 

306 signals whose features are embedded in fixed definite temporal or spatial 

307 regions or there is prior knowledge about a suitable size of window kernel for 

308 STFT processing. Without the above conditions, STFT would totally miss 

309 signal features. Theoretically, the disadvantage can be avoided by employing 

310 STFT with multiple kernel sizes; however, it raises up another issues such 

311 as what range of sizes would be chosen to optimally extract useful features 

312 with reasonably computational effort. 

313 Problems of STFT in analyzing local frequency have motivated develop- 

314 ment of multi-scale wavelet techniques for better local frequency represen- 

315 tation. Since limitations of STFT is due to fixed-size processing windows, 

316 wavelet analysis deploys multi-resolution filter-banks on input signals. As il- 

317 lustrated in figure [TJ 1-D signals are decomposed into low-pass and high-pass 

318 components. In case of 2-D input signals, the filtered outputs are four sub- 

319 bands: low-low, high-low, low-high, and high-high in regards of processing 

320 orientation. Intuitively, 2-D signals analysis includes row-wise 1-D analysis 

321 followed by column-wise 1-D analysis or vice verse. With respect to process- 

322 ing direction, high-low sub-band tends to extract horizontal features, low- 
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323 high sub-band prefers vertical features, high-high sub-band detects diagonal 

324 features, and low-low are approximated version of original signal by inverse 

325 dyadic scale. Lets assume that input signals are two-dimensional grey-scale 

326 image f(x,y), and the scaled mother wavelets have following mathematical 

327 iormip S;i (x, y) = 2 s ip(2 s x,2 s y)\ i= ^ Vjhjd -j for vertical, horizontal and diagonal 

328 sub-bands and a scaling function 4>s{x, y) = 2 s <p(2 s x, 2 s y) for low-resolution 

329 signals with s > S. Then, we can represent any images f(x,y) G ^(R) as. 

f(x,y) = ^2 c s(x,y)</> s (x,y)+ d s>i (x,y)i/; s>i (x,y) 

x,y i={v,h,d} x,y,s>S 

330 where 

cs(x,y) = J f{x,y)<f) S (x,y)dxdy 

331 

d s ,i(x,y)\i={v,h,d} = J f(x,y)^ s ,i(x,y)dxdy 

332 cs{x, y) and d Sj i(x, y) are scaling coefficients and wavelet coefficients from 

333 vertical, horizontal and diagonal sub-bands. The parameter S represents the 

334 lowest analyzing depth while s is higher decomposing levels in multiple scale- 

335 space framework. As mentioned before, 2-D DWT can be obtained by tensors 

336 products of 1-D DWT when the orginal image f(x,y) is analyzed along two 

337 dimensions x and y separately. As a result, the scale function <p(x, y) is 

338 approximated as <j)(x)(j)(y) and filter-banks of three directional sub-bands are 

339 (j)(x)ip(y),ip(x)(f)(y),'ip(x),xl>(y). In the figure [TJ discrete wavelet transforms 

340 are carried on the sample image in the left hand- side. On the right-hand side 

341 contains decomposed results by two levels with three distinctive sub-bands 

342 and a down-sampled version of the original image. 

343 Noteworthy that, the real- wavelet transform like DWT suffers from shift- 

344 variance, a small shift in the signal can greatly change magnitudes of wavelet 

345 coefficients around singularities. Furthermore, it has no phase to embed sig- 

346 nal location information therefore aliasing effects would be introduced into 

347 recovery process. These issues need seriously considering whenever the dis- 

348 crete real wavelet transform is employed. Therefore, modelling statistical 

349 property of DWT coefficients' magnitude across scales might request extra 

350 investigation with those draw-backs in mind. Further arguments and details 

351 about this matter of DWT descriptors will be discussed in the section [2j 
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Figure 1: Wavelet (solid line) & Wavelet Packet Decomposition (solid and dash lines) for 
1-D signal 
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352 5.2. Discrete Wavelet Packet Transform 

353 Well-known DWT can be computed efficiently by an orthonormal FIR 

354 conjugate quadrature filter banks g , gi including analysis low-pass and high- 

355 pass filters ( denoted (fis and if) S) i respectively. The low-pass coefficients 

356 cs{x,y) are decomposed recursively for a number of levels, and inverse dis- 

357 crete wavelet transform is calculated by an inverse filter bank. The exten- 

358 sion from normal wavelet transform (DWT) to wavelet packets transform 

359 (DWPT) is straightforward by an additional step at each processing level. 

360 Instead of decomposing only low-pass coefficients in the Low-Low sub-band 

361 for 2D input signals, the transform performs decomposition of high-pass co- 

362 efficients in Low-High, High-Low and High-High sub-bands as well. As a 

363 result, all coefficients of DWPT can be neatly arranged in a binary tree and 

364 addressed as follows. 

d Sii (x,y) ,se [0,5], ie [0,4 s - 1] 

365 where s is a analysis depth in the tree, S notes the deepest decomposed level, 

366 and i is the node index in this depth. With regards to other representations, 

367 wavelet packets have advantages in their adaptability to varying statisti- 

368 cal structure. Unlike Fourier Transform with one fixed-size base or normal 

369 wavelet transform with a fixed number of bases, we may search the "best" 

370 orthonormal bases from dictionary of basis acquired after wavelet package de- 

371 composition. This idea is initially proposed by Coifman et al^ZE\ mainly for 

372 signal compression. Therefore, this "best" basis is the best in terms of com- 

373 pressing ratio which often desires sparsest representation. In other words, 

374 input signals can be characterized by few large coefficients. Supposed the 

375 whole best basis operation is denoted as B 2 which exhaustively goes through 

376 the whole binary tree to look for locations of a set basis with parameters 

377 (s, i) such that the there is a minimum amount of uncertainty measured by 

378 Shannon entropy. More details can be found in Coifman's works [28], and 

379 B 2 can be summarized mathematically as follows. 

B 2 {al S)i ) : (s,i) = argmin Sji (y j H(d Sji )) 

380 Noteworthy that, sometime "brute- force- attack" every branch of the tree is 

381 not possible or feasible due to intensive requirement of computational power. 

382 Fortunately, there exist fast algorithms to implement the best basis for given 

383 signals. Then, the optimum time-frequency representation can be achieved 
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384 by tilting the time-frequency plan in accordance with best-basis algorithms. 

385 Though the representation may be optimally sparsest in time-frequency do- 

386 main, whether sparseness of features suitably matches performance of human 

387 visual attention is still a question to be answered. To rectify the matter, ex- 

388 periments have been carried out and performance comparison between the 

389 DWT case and DWPTBB case is reported in the section |8j 

390 5.3. Quaternion Wavelet Transform 

391 Like wavelet packet transform in the previous discussion, Quaternion 

392 Wavelet Transform (QWT) is extended and enhanced to eliminate shift- 

393 variance problems from real discrete wavelet transform (DWT). Though there 

394 are a few different definitions and implementations of Quaternion Wavelet 

395 [29|, the QWT implementation in this paper is inspired by Chan's research et 

396 al [26]. Some backgrounds about complex wavelet transform (CWT) with its 

397 implementation dual-tree complex wavelet transform (DT-CWT) Kingsbury 

398 et a/[30j need reviewing before the QWT can be explained and discussed. In 

399 discrete wavelet transform, 2-D DWT can be considered as concatenation of 

400 two consecutive 1-D DWTs. Though the same process does not exactly hap- 

401 pen in 2-D DT-CWT or QWT straightforwardly. A similar concept is used 

402 for easily explaining how QWT can be achieved. It means 1-D DT-CWT will 

403 be elaborated first; then, we will discuss about 2-D QWT signals and how it 

404 may handle processing along different orientations and sub-bands. 

405 Real DWT have well-known drawbacks in terms of shift-invariance and 

406 phases to encode coefficient locations. Kingsbury et al [31] reckons problems 

407 and proposes an dual-tree CWT as a specific solution. Rationale of the dual- 

408 tree approach is usage of complex numbers for wavelet coefficients which 

409 directly tackles one of two DWT's dragging problems. Complex extension of 

410 wavelet transform makes phase extraction from wavelet coefficients possible 

411 since complex wavelet transforms have both real and imaginary values unlike 

412 real DWT with only one real value for each coefficient. Real and imaginary 

413 components of the dual-tree CWT are generated by two sets of wavelet and 

414 scaling functions ^^and 4>h,(pg- Moreover, filter-banks h , /^and go,gi 

415 have to be independent and orthogonal as shown in the figure [2} 

416 The notations 4>h{ x ) an d iph(x) are denoted for scaling and wavelet func- 

417 tions corresponding to filter-banks h ,hi. In addition, Qi s and c4 s with s < S 

418 denotes first set of DTCWT coefficients. Similar notations are used for the 

419 second set of scaling and wavelet functions (p g (x) and ip g (x) with filter banks 

420 <7o,<?i and according coefficients c 9s and d 9s with s < S. Wavelet functions 
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421 iph(%) and i/j g (x) forms two binary trees in figure [2] and at leaves of each tree 

422 is the real and imaginary parts of a complex analytic wavelets. 

tp (C \x) = lj} h (x) +jt/jg(x) 

423 Moreover, the imaginary wavelet ip g (x) is 1-D Hilbert Transform of the real- 

424 wavelet ijjh(x): 

425 Any complex wavelet coefficient are formed by wavelet coefficients of two 

426 other real wavelet transforms; therefore, this combination generates a 2 x 

427 redundant tight frame. This redundancy in complex wavelet frames prevents 

428 non-oscillating magnitudes of coefficients around singularity points as well 

429 makes the transform near-shift invariant. Furthermore, there is no energy ( 

430 or little energy in practice ) in the negative region of frequency because of a 

431 relationship between two wavelet functions iph and ip g . 

432 then 



= y h (u)+jVg(uj) 




and 



Thus, the Fourier transform of complex wavelet transform ty c (u) has no 
energy in the negative frequency region. It makes DTCWT an analytic 
wavelet transform with analytic output signals. Due to this analyticity, the 
dual-tree wavelet transform has implicitly managed to include all information 
in the half positive plan of the frequency domain. 

It is quite straight forward for 2-D DWT expansion from 1-D DWT, dis- 



cussion in the previous subsection 5.1 However, it is unfortunately not easy 



for similar expansion from 1-D DTCWT to 2-D DTCWT transforms because 
Hilbert Transform (HT) and analytic signals need an theoretical extension 
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443 for 2-D signals. Furthermore, there exists not only one but several defi- 

444 nitions which define different zero-out regions ( negative frequency domain 

445 in ID DT-CWT) , signal-power regions (positive frequency domain in 2D 

446 DT-CWT). In this paper, we only focus on Bulow definition [29] of analytic 

447 quaternion signals which combines both partial and total Hilbert transform 

448 (HT) . Partial HTs are done along either x or y directions only; meanwhile, 

449 total HT is carried out on both directions simultaneously. They are defined 

450 as following formula. 



/„„(*) = /(x)oM 
/ % (x) = /(x)o ' 



Tr 2 xy 



The fa. , fa i2 (x) are partially Hilbert transformed along x and y axis 
consequently, and /^(x) is total HT; while o denotes 2-D convolution. Each 
2-D CWT basis is a complex analytic function, computationally equivalent 
to a product of two 1-D complex wavelet functions either along only one or 
both axis. Similar to expansion of discrete real wavelet, the diagonal sub- 
band wavelet is defined as /(x) = iph{x)iph{y)- Other total and partial HT 
are products of coefficients from different sets of wavelet functions deployed 
in the 1-D CWT implementation. 

(fa h , fa a ,fai) = (^ g (x)ip h (y),iJ h (x)ij g (y),ijg(x)ij g (y)) 

To unify all different Hilbert Transform in a meaningful and compact rep- 
resentation, we can utilize quaternion algebra and treat /(x) as a real part 
and (fa h , fa i2 i fai) as three imaginary components [26J. 

/j(x) = /(x) + hfa h (x) + hfa i2 (x) + hfaW 

More details about theory behind QWT and its special characteristics such as 
its singular cases, three phases, and zero-out regions can be found in Chan 
etal 's and Bulow 's publications [26l 129] . Resting on form of the above 
quaternion wavelet transformation, we can organize four quadrant compo- 
nents of 2-D wavelet (/ , fa t , fa i2 , faj as a quaternion. Lets take a example 
of diagonal signals with following quadrant components. 

(/, fn h , fa i2 , fa-) = (*Ph(x)*p h (y),^g(x)ij h (y) } ip h (x)iljg(y),^g(y)ilj g (y)) 
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468 We can have a diagonal quaternion wavelet functions for the diagonal sub- 

469 band mathematically defined as follows. 

tp D (x,y) = ip h (x)ip h (y) + j l i) g {x)i) h {y) + j 2 ip h (x)ip g (y) + ] 3 *P g (x)^ g (y) 

470 To compute the QWT coefficients, we can use proposals of a separable 2- 

471 D implementation [31] of dual-tree filter-banks previously illustrated in the 

472 figure [2j At each filtering stage, both two-sets of wavelet filters h and g are in- 

473 dependently applied to each dimension x and y of a 2-D image. For example, 

474 the filter-bank h is applied along both axis; then, it yields the scaling co- 

475 efficients Chh s and three diagonal, vertical and horizontal wavelet coefficients 
dhh s i dhh a an d dh ha respectively as shown in the figure^] Dual-tree implemen- 




Figure 3: Illustration of 2D dual-tree complex wavelet transform 

476 

477 tation of two separated filter-banks for 1-D signal can be considered as four 

478 independent filter banks for 2-D signals according to all possible combina- 

479 tions of filter for one dimension (hh, hg, gh, gg). With these combinations of 

480 filters and corresponding wavelet functions ip(x)4>(y), 4>(x)ip{i)) and ip{x)ip{y) 

481 are generated four components of quaternion wavelet transform for horizon- 

482 tal, vertical, and diagonal sub-bands. Four different wavelet coefficients from 

483 these filter banks are arranged by quaternion algebra to obtain QWT coef- 

484 ficients. For example, a coefficient from diagonal wavelet sub-band of QWT 
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485 can be written in terms of responses from independent filter-banks as follows. 



ds = d hh + jidg h + hdl a + hd. 



•D 

99 



So far, we have taken a diagonal sub-band as example for showing how QWT 
can be computed. The construction and properties for other two sub-bands 
are similar to what have been done for diagonal sub-bands. Except that the 
axis combinations results in a horizontal sub-band i/j(x)(f)(y) or for a vertical 
sub-band ip(x)ip(y) instead of a diagonal sub-band ilj(x)ifj(y). In summary, 
QWT at each stage sports three quaternion sets corresponding to three sub- 
bands; each quaternion contains four wavelet functions. Therefore, there 
are 12 functions in total which can be easily seen as matrix of functions as 
follows. 



d H 

a hh 

d H 

gh 

d H 

a hg 

d H 



d\ h 

d v 
gh 

< 



d D 

u hh 
d D 

gh 
d D 

a hg 



gg 

^h(x)(/>h(y) <Ph(x)4>h{y) ^h(x)*Ph(y) 

ip g (x)(f) h (y) (f) g {x)i) h {y) ip g (x)ijj h (y) 
^Ph(x)(f)g(y) <f) h (x)^ g (y) ip h (x)ipg(y) 

4>g(y)My) My)^g(y) ^g(y)^g(y) 

Columns of the above matrix correspond to quaternion wavelet functions of 
the horizontal sub-band d H , the vertical sub-band d v , and diagonal sub-band 
dP from left-to-right respectively. The three according wavelet coefficients are 
df, d^ and df and the = operator means formation of quaternion number 
by coefficients along each column. Though quaternion wavelet coefficients 
possess rich phase information, our research currently focuses on magnitudes 
of each wavelet sub-bands. Therefore, magnitudes of horizontal, vertical 
sub-bands can be computed according to quaternion magnitude formula as 
follows. 



IMfll = \Kd H M 



(d? h y 



(dL 



(d"Y 



\\df 



V(^)- 2 + ( d &) 2 + ( d S)- 2 + (^) : 



504 While the final approximated version of input signals, which are not decom- 

505 posed further by the transform, have its magnitude computed by quaternion 
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506 algebra. 

IMI = yj {Chh)- 2 + {Cgh) 2 + {Chg)- 2 + {Cgg) 2 

= ^J(M^My))- 2 + (<Pg(x)My)) 2 + (M^M) 2 + (<P 9 (y)My)) 2 

507 5.4- Quaternion Wavelet Packet Transform 

508 To construct a packet form of QWT, each and every sub-band cs,d H ,d v , d 1 

509 should be repeatedly decomposed by low-pass (ho,go) and high-pass filters 

510 (hi,gi). Bayram et al[32\ has investigated into formation of wavelet pack- 

511 ets for DT-CWT, an equivalent form of QWT. In order to get an analytic 

512 quaternion wavelet packet, the filter banks need to be chosen in a specific 

513 way such that the Hilbert transform relationship is preserved. In Bayram's 

514 works [32], the analytic wavelet transformed can be achieved if whatever 

515 filter-bank is used to decompose the first filter-bank of QWT should also be 

516 used for the second (dual ) filter-banks. Another important point about the 

517 extension to wavelet packet QWPT is the choice of the extension filters fi(x). 
sis It has been found that the only necessary constrain to preserver the Hilbert 

519 transform property is forcing the usage of the same filter-pairs fo(x),fi(x) in 

520 both filter-banks of QWT or DT-CWT . Therefore, any CQF pair of filter- 

521 banks with short support, frequency selectivity or possessing a number of 

522 vanish moments can be candidates for the extension filter. Noted that, the 

523 above criteria such as CQF pair of filters have been employed for extending 

524 a regular DWT. Like other derivatives of DT-CWT or QWT, the quater- 

525 nion wavelet packet transform (QWPT) are approximately shift-invariant, 

526 which means the energy in each sub-band is approximately preserved if the 

527 input signals are shifted by a number of samples. Noteworthy that, there are 

528 other methods beside QWPT with shift-invariant property in wavelet pack 

529 decomposition. For example, by performing an exhaustive search over all 

530 shifted wavelet packet bases to find the "best basis" according to a certain 

531 cost function [33], the orthonormal wavelet packet transform becomes shift- 

532 invariant in a sense that energy in each sub-band is invariant to transition of 

533 input signals . This (approximately) shift-invariance property becomes very 

534 useful and important in the search for a suitable energy descriptors. This 

535 shift-invariance property guarantees that DT-CWT, DT-CWPT or QWT, 

536 QWPT would have energy descriptors robust to certain amount of afline 

537 transformation in input signals. 

538 Interactions of filtering both low and high components at each stage of 

539 DT-CWT introduces a complete structures of all possible sub-bands that 
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540 can be generated by the filter-bank pair. Each tree forms a unique frequency 

541 profile of input signals. Among those countless numbers of possibilities, there 

542 exist a frequency decomposition being more sparse and compact than the 

543 others. It is called the best-basis in terms of representing the input signals 

544 with fewest wavelet coefficients. A fast algorithm for indicating such best 

545 basis has been reported in extension from DWT to DWPT [2H]- In addition, 



546 it is previously mentioned in the section p72| the same strategy can be adopted 

547 for searching best-basis in QWT. In brief, the approach is looking for a path 

548 in a binary of decomposition to minimize a Shannon entropy cost function; 

549 more details can be found in the work of Coifman et a/ [28]. After "best 

550 basis" searching for QWPT decomposition, we can identify magnitudes and 

551 energy of coefficients at a specific location by a simple quaternion algebra. 

552 \\q(a, b, c, d)\\ = V a 2 + b 2 + c 2 + d 2 . 



553 6. Wavelet Coefficients Correlation 

554 The previous section [| have discussed the potential of using energy den- 

555 sity distribution of localized time-scale element or wavelet elements instead 

556 of pixel-value probability distribution. Only general 1-D signal is considered 

557 and these elements are assumed to be independent or at least linearly in- 

558 dependent (uncorrelated) ; however, this assumption only works for random 

559 variables as input signals. Practically, except total noise, any meaningful sig- 

560 nals often has specific structures persistent across multiple time-scale element 

561 in 1-D case. For 2-D signals like natural images, an additional orientation 

562 needs considering; in other words, their wavelet coeffcients are highly sta- 

563 tistically related across scales, orientation, spaces. This phenomenon is sys- 

564 tematically studied and confirmed in Azimifar et a/research [31]. The author 

565 has conducted an empirical study of joint wavelet statistics for texture and 
see natural images to investigate correlation relationship between neighbouring 
567 coefficients. Examination of these dependencies helps propose appropriate 
sea models for such a transform-domain algorithm. Though Azimifar's work [31] 

569 only covers linear dependencies and just a squint on non-linear relations, its 

570 proposals are evaluated on a collection of 5000 real images. Therefore, we 

571 believe her conclusion in that study is generally true at least for natural im- 

572 ages, the main researching objects. In brief, there exists a few elementary 

573 correlation relationships as follows. 

574 • The spatially-localized and sparse correlation structure has a clear per- 

575 sistence across scales. 
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576 • Every coefficient exhibits correlations extending across multiple scales, 

577 with spatially near neighbors both within and across orientations. 



578 • A subband coefficients at the same spatial locations but from different 

579 orientations are not linearly correlated. 

580 • Within-subband, inter-subband, and inter-scale correlations are highly 

581 oriented and persistent across local neighbors of its parent. 

582 The below figure [4] clearly illustrates all mentioned correlations, their pref- 

583 erences to locality as must be expected. This locality increases toward finer 

584 scales, which supports persistency property of wavelet coefficients. A single 

585 coefficient correlates with its parents as well as neighbors across orientations 
and scales. Among several mentioned statistical dependencies, the most vi- 




Figure 4: Illustration of wavelet coefficients inter-band and intra-band correlation [33] 

586 

587 tal findings for our work are uncorrelated siblings coefficients across orien- 

588 tation and strong correlated coefficients across scales since it theoretically 

589 allows uncertainty and mutual information estimation of a 2D time-scale 

590 element or a wavelet sub-band energy descriptor. To elaborate this point, 

591 lets consider two adjacent scales si, s2 and their corresponding coefficients 

592 Wif(x,y,si) of horizontal, vertical, diagonal orientation i = v,h,d for 2-D 

593 signals f(x,y). Due to non-correlation of sibling coefficients across orien- 

594 tation, it is possible to consider three wavelet coefficients as a multivariate 

595 variable W s = (wh,Wd,w v ) with uncertainty estimation by energy density 

596 distribution across three orientations H(W S ). For two adjacent sub-band 
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597 si,s 2 , there are two multivariate variables W Sl and W S2 corresponding two 

598 entropy values H(W Sl ) and H(W S2 ), and the mutual information between 

599 two variables due to inter-scale dependencies between correspondent wavelet 
eoo coefficients are computed as follows. 

I(W S1 , W S2 ) = H{W S1 ,W S2 ) - H{W S1 ) - H(W S2 ) (6) 

601 From that basic observation about inter-scale and intra-scale wavelet coef- 

602 ficients of natural images is developed the core idea of our proposal. More 

603 details about sub-band energy descriptors and how to measure their uncer- 

604 tainty and mutual information will be clearly explained in following sections 



6.1. Interscale Subband Energy Descriptor 

Interesting relationship between basis-project methods and scale saliency 
are repeatedly discussed in several publications [7j, [13]. Kadir [7] actually dis- 
cusses about behaviours of non-saliency and saliency regions in spectral and 
wavelet domain. A simple, flat, non-salient regions or images is sufficiently 
described by a single sub-band; meanwhile, complicated data and structure 
regions require more sub-bands descriptors. This directly introduces basis- 
projected sub-bands as potential alternative descriptors. Like pixel-value 
descriptors, real wavelet sub-bands must be treated as discrete variables due 
to its theoretical restriction, data-analysis uncertainties, o t a^ > |. In other 
words, it is impossible for continuous wavelet sub-bands distribution at any 
specific location. Following available mathematical definition of PSS for dis- 
crete pixel descriptors, we sketch rough mathematical models of WSS with 
discrete sub-band descriptors, {e G E, E = {e±, e 2 , ■ ■ ■ e m }}, in the equations 
[7J [8j |9| |T0] whereof e, E are a element and set of sub-band descriptor consec- 
utively. 

Y D (s* p ,x) = H D (sp,x)W D (sp,x) (7) 
H D (s,x) = -^2p b ,s,xl°92P(d,s,x) (8) 

d&D 
S 2 

W D (S,X) = — - ^ \Pb,s,S - Pb,s-l,x\ (9) 

deD 

= {s : H D (s-l,x) <H D (s,x) >H D (s + l,x)} (10) 

However, a general concept of sub-band descriptor is not useful in actual com- 
putation; therefore, an appropriate numerical attribute of sub-bands need 
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628 



proposing instead. Lets consider 2-D discrete real-wavelet transform with 
three sub-bands vertical (v), horizontal (h) and diagonal (d) sub-bands at 
each particular dyadic scale s represented by three set of wavelet coefficients 



(Wi) accordingly in the equation 11 Equation 12 uses those coefficients to 
compute sub-band energy densities as descriptors (e) for Wavelet-domain 
Scale Saliency (WSS). 

Wi/(ar, y, Sj)\ i= { Vihid}tSj={siiS2r .. tSn} = f (x, y) * ip Sii (x, y, Sj) (11) 
P{w 1 /(ar,y,s i )}| i= { Bjh>d } iaj={ , ljaa> ... i8n } = |w/(x, y, Sj)\ 2 (12) 

622 In the standard real discrete wavelet transform (DWT), there are fixed three 

623 analysed sub-bands for each dyadic sampling step. Supposedly the maximum 

624 level of wavelet decomposition is n, the number of dyadic scales is n with 3 

625 sub-bands for each scale. With 4 or 5 as the usual number of decomposition 

626 levels, totally around 12 or 15 sub-bands descriptors are analysed for an 

627 image. This number of descriptors is significantly less than 255 pixel-value 
descriptors of PSS for any grey-scale image. 

Besides wavelet transforms, different other types of basis projection tech- 
niques could also be utilized; for example, best basis wavelet packet analysis 
(DWPTBB). The full wavelet packet transform breaks signals into sub-bands 
with the same bandwidth at the maximum dyadic scale. It would not fit into 
the scale saliency concept which requires descriptors at different scales. For- 
tunately, the "balanced" full wavelet packet tree usually over-describes image 
properties, and the description can be optimized by Best Basis (-B 2 ) finding 
operation. The optimized wavelet packet tree often has projected basis across 
dyadic scales since some small image details are best described with a basis at 
finer resolution while other big details prefer another basis with coarser res- 
olution. The DWPTBB coefficients are utilized for sub-band energy density 



the proposed image descriptors, calculation in the equations 13 14 



Wi/(^,y ; Si)l(i,j)=B 2 (wi/(cc,y,^)) = f(x,y) *^ Si i(x,y,Sj) (13) 

P{wt/(l, J/, Si)}l(y)=B»(w,/(x, W )) = |w i /(x,y,S i )| 2 (14) 



629 Comparing mathematical statements TTfT2 and T3p4 for sub-bands descrip 



630 tors of Discrete Wavelet Transform (DWT) and Discrete Wavelet Packet 

631 Transform Best Basis (DWPTBB) consecutively, we can see their fundamen- 

632 tal differences. While DWT provides determinant basis-projection methods 

633 with pre-computed basis and fixed structure of sub-bands, DWPTBB adapts 
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itself into each data set. Then, its number and structure of sub-bands are 
specified by Best Basis {B 2 ) finding operator [28]. It requires more opera- 
tions; however, more faithful and adaptive descriptors can be achieved. 

Both DWTBB and DWT are popular wavelet-transforms; however, they 
both depend on shift-variant real discrete wavelet transforms. It means that 
projection of coefficients not only depends on data but also its relative loca- 
tion on the scene. 

Wi/(x, y, sj) ^ wj(x + A(x),y + A(y), sj) , 3x, y, Sj, w ; , A(x), A(y) 

(15) 

P{wif(x, y, sj)} ^ P{wif(x + A(x),y + A(y), Sj)}, 3x, y, Sj, w i? A(x), A(y) 

(16) 

As the fourth criteria for good information measurement of Starck et al\17\ 
states that entropy must work in the same way regardless of descriptors' lo- 
cations. Both DWT and DWTBB projected descriptors do not satisfy that 
condition since usages of these descriptors might lead to different information 
estimation for identical data at two different locations. The shift-variance 
of real-wavelet transform can be avoided by complex wavelet transform de- 
sign; for instances, recently developed dual-tree complex wavelet transform 
(DTCWT) [35J, Quaternion wavelet transform (QWT) [26], or dual-tree com- 
plex wavelet packet transform with best-basis (DTCWTBB) [32]. General 
formula of complex coefficients and their corresponding sub-band energy den- 
sity are summarized in the equations l~7fl~8 



w i /(x,y,s j )|i ={{?;>M} vB2(w i /)} = f(x,y) * {il> g ,s,i (x,y,Sj) + jij h!S!i (x,y, sj)) 

(17) 

P{w i /(a;,|/,s i )}| Sj ={ S i,.., s „}vB 2 (w i /) = \\^if(x,y,Sj)\\ 2 2 (18) 

637 Dual-tree approaches use two different wavelet filter-banks ,{ip g , 4>h\i an d 

638 they are designed to form analytical complex filter banks, {ip g (x, y)+jiph(x, y),iph(x, y) 

639 H (ip g (x,y))}. The magnitudes of projected-complex coefficients are proven 

640 to be shift-invariant; therefore, its derived energy density of the sub-bands 

641 is as well shift-invariant. Probably, the quaternion version of wavelet trans- 

642 form (QWT) and quaternion wavelet packet transform best basis (QWTBB) 

643 with shift-invariant property would provide better descriptors than their real 

644 counterparts according to five criteria of Starck [T7] . 
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645 6.2. Intra-scale Subband Energy Descriptor 

646 As previously mentioned in the section [6j there is strong correlation or 

647 statistic linear dependence between wavelet coefficients in natural images. 

648 The first correlation, the inter-scale dependencies, has been discussed in the 

649 section 4j6, and modeled as sub-band descriptors in the previous section 6.1 



670 



650 Moreover, the relation have been widely and effectively employed in various 

651 tree-structured coding techniques such as SPIHT [36]. Besides inter-scale 

652 relationship, many authors [25J have pointed out another strong correlation of 

653 intra-band coefficients existing across many different types of natural scenes 

654 . Minh Do and M Vetterli [25J successfully modelled coefficients of a wavelet 

655 sub-band with a simple explicit mathematical form, Generalized Gaussian 

656 Distribution (GGD). While statistical distribution of wavelet coefficients gets 

657 a lot of interest, several researches have proposed different mathematical 

658 models for analysing this statistical characteristic. However, few models of 

659 wavelet coefficients marginal density at a particular sub-band works better 

660 than GGD in terms of accuracy, approximation and simplicity. After such 
eel the distribution is widely observed in experimental data with natural images, 

662 theoretical analysis on the plausibility of modelling by the GGD distribution 

663 is defined as follows. 

v ( x - a R) = ^ e (-\*\/*f 

664 where T(z) = / °° e~ t t z ~ 1 dt, z > is the Gamma distribution. Here a dic- 

665 tates the scale parameter or variance of the distribution, and /3 controls 
eee shapes. For example, GGD with/3 = 1 is Gaussian distribution; it becomes 
667 Laplacian distribution with = 2. 

ees 7. Information Measurement 



In the previous section 6A_ , four different wavelet transforms generate cor- 
responding wavelet sub-band energy density descriptors. From those energy 

671 density, energy probability distribution function (PDF) at each scale Sj can 

672 be computed as follows. 

Pinterband{x, y,Sj) = p{P [vfif(x, y, Sj)} }\i={ v ,h,d}Vi=B^(wi)},j<=m 

P [Wj/(x,y,Sj)] 

E 7 Ei p [wi,/(x,z/,^)] 
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673 The above formula computes the probability of energy density at one loca- 

674 tion (x,y) across different sub-bands, i = {v,h,d} for (DWT) or (QWT) 

675 or i = B 2 (w{), for (DWPTBB) and (QWPTBB) from the smallest scale, 1, 

676 to currently considered scales, m. The first level uses the smallest sampling 

677 window size of wavelet atoms; therefore, it generates analysed coefficients 

678 with finest details. Then, the sampling window sizes are doubled after each 

679 level; they generate coarser analysed details. It is quite similar to PSS sam- 

680 pling operations except that scales are doubled rather than increased by a 

681 unit. Like PDF of PSS descriptor, WSS descriptors PDF are distributed 

682 with increasing scales of j, from level 1 ( smallest wavelet atom ) to level m 



683 ( currently biggest wavelet atom). From the equation 19, it is straightfor- 

684 ward to compute feature-space entropy Hobserver (x, y, s m ) as follows whereof 
ess p{P [wif(x, y, sj)}} is shorted as p in terband(x , y, Sj) . 



Pinterbandi^i Vi Sj) ^Q&Pinterbandix , y, Sj') (19) 
{i={v, h,d}Vi=B 2 (w l )},{j<=m} 

686 Both entropy of PSS's descriptors and the above entropy formula for the 

687 proposed descriptor only summarizes statistical property in local spatial re- 
ess gions since both considering window sizes in PSS and scale levels of wavelet 
689 decomposition are finite. Then, it lacks involvement of energy distribution in 
ego the whole image and it is confirmed that such distribution is vital for natural 



691 image and texture modeling [25]. As presented in the sub-section 6.2 is the 

692 Generalized Gaussian Distribution of coefficients magnitudes from a wavelet 

693 intra-band. 

te^n-tJm^**'* (20) 

694 In order to combine both global and local characteristics into a single 

695 value, we propose cross-entropy H Searcher (x, y, s m ) between inter-band and 



696 intra-band distribution as an alternative formulation of the equation 19 



~~ ^] Pinterband{x,y, Sj)\ogPi n traband{^^y^ Sj) (21) 

{i={v,h,d}\Ji=B 2 (wi)},{j<=m} 

697 To distinguish between two modes of entropy computation, we names 

698 the local entropy by the equation 19 as "observer" mode, and the cross- 

699 entropy involving both local and global statistics as "searcher" mode. In later 
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700 
701 
702 
703 



730 



formula, when general entropy symbol H without specific subscripts appears 
in any formulas, it means both modes are eligible for those equations. Those 
names also help to distinguish different parameters and simulation modes 



presented in the experimental sections 8.2 



704 The equation [19] computes feature-space entropy of sub-band energy de- 

705 scriptors for WSS as the equation [8] does for PSS. Half of scale saliency 

706 measure, feature-space entropy, has been figured out for sub-band energy 

707 density descriptors. The other half of the problem rests in computational 

708 details of inter-scale saliency; in other words, how the equation [9] should 

709 be interpreted with the proposed descriptors. In equation [8j the inter-scale 

710 saliency is measured as total variation in probability distribution of descrip- 
7n tors at two consecutive scales in which pixel-value descriptors (d) appear 

712 in both distributions, it complicates the problem. However, the situation 

713 is different for wavelet sub-band energy density descriptors since each sub- 

714 band in the current level is unique for this level only. It does not appear 

715 in other levels of analysis. This wonderful property simplifies out task in 

716 building sub-band probability distribution for different levels but makes the 

717 equation [3] inappropriate for sub-band features. Since it is unjustifiable to 

718 find total variation of two PDF on two different set of descriptors, an alter- 

719 native interpretation of inter-scale saliency need developing. Lets consider 

720 P(M) = {pi j(x, y, Sj)\\/i, j <= m}, PDF of all sub-bands up to the current 

721 level, m. When a new analysed sub-band, D = {pij(x,y, Sj)\j — m + 1}, 

722 is generated, this sub-band descriptor will modify the current PDF into 

723 P(M\D). The distance between the prior model and the modified model can 

724 be measured by Kullback-Leibler divergence as follows. 

K(P(M\D), P(M)) = J P(M\D) log (22) 

725 Noteworthy, it is similar to Itti's Bayesian Surprise Saliency (BSS) metric 

726 [14], and the surprise model can be extended for multiple sub-bands de- 



727 scriptors or evidences in BSS. The equation 22 becomes mutual information 

728 between the current model and a set of new evidences. In other words, the 

729 expectation of surprise for adding new sub-bands into the current model 
is the mutual information between new sub-bands and the current model, 

P(D,M) 



73i shown in the equation 23 



MI(D, M) = ^ K(P(M\D), P(M)) = J P(D, M) log p[M)p{D) 



(23) 
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Therefore, mutual information is chosen as inter-scale saliency for succes- 
sive dyadic scales since it actually implies averaged "bayesian surprise" [H] 
saliency of sub-bands across scales. Furthermore, mutual information as 
inter-scale saliency measurement well emphasizes the structural coherence 
of data across scales. If there are useful structures such as edges or joints 
and they are consistent across consecutive scales, they will increase mutual 
information between two consecutive scales. Otherwise noises have no mu- 
tual information across scales as its self-information is zero, I(N,N) = 0. 
It is remarkable that mutual information satisfies the fifth criterion of the 
good information estimation by Starck et a/[T7]. The only remaining step is 
identifying how the mutual should be calculated in discrete cases. Following 
formula shows relation between mutual information and entropy. 

MI(D,M)= H{D) + H(M) - H(D, M) (24) 
H(M)=- ^ Pi(?,y,Sj)]ogpi(x,y,Sj) (25) 

{{i={v,h,d}Vi=B 2 (w 1 )},{j<m}} 

H(D) = - Pi{x,v,8j)\ogPi(x,y,aj) (26) 

{{i={v,h,d}Vi=B 2 (vn)},{j=m}} 

H(D,M)=- PiiwAlogpifay^j) (27) 

{{i={v,h,d}^i=B 2 (w i )},{j<m+l}} 

The mutual information can be directly calculated as difference between sep- 
arated (H(D) + H(M)) and joint (H(D, M)) entropy estimation of the cur- 
rent energy descriptors (the current model) and the next-level sub-bands, the 
equation 24, While the entropy elements H(D), H(M), H(D, M) can be eas- 
ily estimated by simple mathematical equations [25|26||27| The joint entropy 
H(D, M) can be reused as H(M) for the next level inter-scale saliency esti- 
mation because of the sub-band descriptors uniqueness. The scale saliency 
principles on wavelet-domain sub-band energy descriptors are summarized 
in the equation 28 as product of maximum feature-space saliency and inter- 
scale saliency, or product of mutual information between consecutive levels 
and maximum sub-band entropy. 

H(M(x,y,s p )) = - Y Pi(x,y, s p )logPi(x,y, s p ) 

i={v,h,d}\/i=B 2 ('Wi),j<m 

MI(D(x, y, 8p ), M(x, y, s p - 1)) = H(D) + H(M) - H(D, M) 

s p 4 { s : H{M(s - 1, x, y)) < H(M(s, x, y)) A H(M(s, x, y)) > H(M(s + 1, x, y))} 

Y{M{x, y, s p )) = H{M{x, y, s p )) * MI{D{x, y, s), M{x, y, s p -l)) (28) 
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The characteristic scale s p is chosen to maximize information of the model 
H(M(s,x,y)). Lets imagine the case prior scale contains only noise mean- 
while later scales actually contain useful structures of images. With bias of 
Shannon entropy toward noise, the characteristic scale fails to enclose any 
useful structure. To overcome this drawback, we propose alternative ap- 
proach, DIS to differentiate from the original strategy WSS, in which s p is 
selected so as to maximize inter-scale saliency or average " Bayesian surprise" . 
DIS principles can be summarized as follows. 

s p = {s : AfI(D a _i, M s _ 2 ) < MI(D S , M s _i) A MI(D S , M s -i) > MI(D S+1 ,M S )} 

732 Experiments with DIS and WSS are carried out and simulations results are 

733 detailed in the next section in order to confirm effectiveness of the proposed 

734 strategy. 



735 8. Discussion & Results 

736 The previous sections 6A_ and [7] have analysed theoretical advantages of 



737 WSS and its derivative DIS. In addition, the subsection |6.1| present four 

738 descriptors based on different wavelet transforms: DWT, QWT, DWPTBB, 

739 and QWPTBB. Accordingly, we have several derivatives for the proposed 

740 method according to specific choices of scale section mechanisms and sub- 

741 band descriptor. To evaluate them against other saliency approaches, they 

742 are compared with PSS [7], and the de-facto ITT model [3]. The purpose of 

743 comparisons are not for claiming the best saliency method or racing toward 

744 the highest possible evaluating measurement; it just proves the rationale of 

745 the assumption that feature and structural complexity would be a good clues 

746 for human attention. The best evaluation measurement reported does not 

747 necessarily mean the best saliency maps since it much depends on choices 

748 of databases, elimination of experimental bias, performance of human test 

749 subjects, etc. Moreover, a standardized evaluation process in saliency map 

750 evaluation is far from being reached since several researchers choose differ- 

751 ent database and measurement methods or even create their own. In our 

752 research, we focus on the effectiveness of information measurement in visual 

753 attention; then, the most common processes and databases would be chosen 

754 to confirm generalization of the assumption. 

755 In line of searches for informative clues , Bruce and Tsotsos [15] database 

756 is certainly among the popularly used stimuli. However, only Bruce's database 



31 



757 is certainly not enough due to limits in numbers and contents of stimuli. 

758 Then, Kootstra's database [IE] are chosen for extra testing samples and 

759 ground-truths. Two database with over 200 samples with ground-truths 

760 provided by more than 50 human subjects would help to confirm the gen- 

761 eralization of our proposed framework to a certain extent. Similarly, only 

762 common evaluation approaches are deployed in our studies, and they can 

763 be categorized into either quantitative or qualitative methods. Quantitative 

764 relations between different saliency methods and human visual performance 

765 are shown by appropriate statistical methods (AUC,NSS) with eye-tracking 

766 data as ground-truths. Meanwhile, the qualitative results, visual compar- 

767 isons of different saliency maps, gives a glimpse about performance for each 

768 individual sample. It also specifies imaging contexts where saliency meth- 

769 ods give reasonable solution as well as situations where saliency maps are 

770 unreasonable to human perception. 

771 8.1. Databases of image stimuli 

772 The ground-truth and data for basic evaluations of visual saliency per- 

773 formance is got from eye-tracking experiments. Specially in Neil Bruce 

774 database, 120 different color images are observed in random orders while 

775 there are 4 seconds gap between the previous and the next stimuli. To en- 

776 sure consistency and accuracy of the database, subjects are asked to seat 

777 0.75 m in front of a 21 inch CRT monitor. Especially, human subjects have 

778 no further instructions for any actions or clues for what images appear next. 

779 Furthermore, image contents are varied from indoor to outdoor environments. 

780 Sometimes, there are clear interesting objects in the scene; while some scenes 

781 are really general without any particular interests in any subjects. A non- 
782 head mount eye tracking apparatus extracts locations of eye-fixation while 

783 human test subjects look at sample images. Other setting-up parameters 

784 are intended for a general-scene based stimuli which are typically found in 

785 urban environments. Moreover, the same parameters are used for collect- 

786 ing data from 20 different subject over 120 testing samples. The following 

787 figure [5] shows first eight images from the Neil Bruce's database. Despite 

788 its popularity, images from Neil Bruce's database has narrow semantic con- 

789 tent since it contains only urban scenes and mainly indoor environments. 

790 Besides that, the number of samples are relatively small. Due to that, an 

791 additional database should be included in simulations such that there more 

792 testing images of natural objects like animals, flowers, in natural environ- 

793 ments. Kootstra's database [16] have just satisfied these requirements with 
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Figure 5: Neil Bruce's database 



794 additional ground-truths for further experiments. Kootstra's ground-truths 

795 data are also collected from eye-tracking experiments although the exper- 

796 imental process is slightly different from what have been done to collect 

797 Bruce's database. In the psychological experiment, with head-mount eye- 

798 tracking devices, thirty-one students (15 men, 16 womens ) ranging from 17 

799 to 32 of age took part in the experiments, and they are all naive about aims 
soo of experiments. Each human subject observes a total of 99 photographic 
soi image in five different categories while their eye movements are recorded si- 

802 multaneously with the head-mounted device. There are nine-teen images in 

803 natural symmetry category; each of which contains symmetrical natural ob- 

804 jects. Beside such symmetrical scenes, other non-symmetrical photographic 

805 scenes are included into the image sets such as: 12 images of animals in nat- 

806 ural seeting, 12 images of street environments, 12 images of street scenes, 16 

807 images of building and 40 images of natural environments. Figure [6] gives an 
80s example of 5 categories of images in the Kootstra's database. Noted that, 

809 each image is presented to viewers with a resolution of 1024x768 pixels on 

810 an 18" CRT monitor at a distance of 70 cm from the participant. 

an 8.2. Quantitative Comparisons of Saliency Methods 

812 The quantitative performance includes Receiver Operating Characteris- 

813 tics (ROC) curves with Area Under ROC Curve (AUC), and Normalized 

814 Scanpath Saliency (NSS) as numerical results. To ensure fair comparisons 
sis between methods, open-source evaluation codes for AUC and NSS [37J are 

816 employed. Noteworthy, saliency maps are standardized around median in- 

817 stead mean of distributions. Quantitative evaluation of visual saliency map 
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Figure 6: Kootstra's database 
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on natural images with eye-tracking data ground has been initially studied by 
Tatler and recently summarized by Borji et a/ [37]. More information about 
mechanisms behind ROC and AUC can be found [37]. In this section, we 
only focus on usages of these quantitative methods to compare, evaluate and 
prove rationale of our approach. As the main purpose of this evaluation is 
confirming effectiveness of informative clues in human's visual attention, our 
approach is not optimally tuned to reach the maximum AUC or NSS. 



All four descriptors mentioned in the section [6+L] have been simulated with 
image samples from both Neil Bruce's and Kootstra's datasets. Noted that 
scale selection mechanisms have strong influences in formation of saliency 
maps; therefore, two separated simulations are carried out to investigate 
that effect as well. Figure [7j and [8] summarizes simulations results of proposed 
methods with corresponding WSS and DIS respectively in Neil Bruce's image 
dataset. 

According to the figure [7j and the table [2j performances of four WSS 
derivatives follow decreasing orders: DWT, QWT, DWPTBB and QW- 
PTBB; however, all are better than PSS performance and comparable to ITT 
method. Especially, a computational time is deducted by approximately 7 
times; noteworthy, the PSS is implemented in C++ with MATLAB interface 
and WSSs are totally written in MATLAB. For Niel Bruce database, the best 
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MTH 


AUC 


NSS 


TIME(s) 


ITT 


0.6944 


0.27714 


1.096s 


PSS 


0.5856 


-0.39175 


7.1092s 


DWT 


0.67823 


0.33358 


1.2401s 


QWT 


0.66279 


0.30002 


1.9231s 


DWPTBB 


0.6417 


0.26079 


2.6187s 


QWPTBB 


0.63529 


0.23714 


5.2836s 



Table 2: Quantitative Result 
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838 basis approaches DWPTBB and QWPTBB does produce poorer results in 

839 both accuracy test, AUC and NSS as well as efficiency test, TIME. 




MTH 


AUC 


NSS 


TIME(s) 


ITT 


0.6944 


0.27714 


1.096s 


PSS 


0.5856 


-0.39175 


7.1092s 


DWT 


0.7028 


0.3178 


1.2689s 


QWT 


0.6922 


0.3024 


1.9527s 


DWPTBB 


0.6299 


0.2546 


2.4218s 


QWPTBB 


0.6394 


0.2351 


5.4835s 



Table 3: Quantitative Result 



840 Quantitative performances of four DIS methods are shown in the fig- 

841 ure [8] and the table [3] Mixed results are spotted. Performances of DWT 

842 and QWT descriptors with DSS approach are a little bit increased in terms 

843 of AUC if compared to the case of WSS. However, "Best-basis" descriptors 

844 (DWPTBB, QWPTBB) perform a little bit better if WSS are employed in- 

845 stead of DIS. Meanwhile, there is almost no difference between WSSs and 

846 DSSs in term of both NSS and TIME regardless descriptors. 
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Above is shown simulation results from Neil Bruce image data-set with 
eye-tracking locations. Despite of its recently popular database in evaluating 



saliency maps, the data-set has limitations analysed in the subsection |8.1 
Another sets of images should be brought in to enhance diversity of testing 
samples. Kootstra's database with more image categories and all eye-tracking 
data ground truth is a perfect candidate. Additional simulation results would 
confirm and generalize rationale of our proposed information-based saliency 
methods. Similar to the table |2j figure [7], the tableland figure [9] demonstrate 
how well the proposed methods with four descriptors and WSS scale selection 
mechanism perform against other saliency methods like ITT and PSS. 



ROC Curves 
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Figure 9: ROC Curve - WSS 



MTH 


AUC 


NSS 


TIME(s) 


ITT 


0.7819 


0.5144 


2.3874 


PSS 


0.5852 


-0.3532 


17.0663 


DWT 


.7150 


0.4849 


.4414 


QWT 


.7301 


0.5070 


2.9313 


DWPTBB 


0.7242 


0.4631 


4.9577 


QWPTBB 


0.7612 


0.3922 


4.7743 



Table 4: Quantitative Result 
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Among four descriptors, the best result in terms of AUC is from QW- 
PTBB descriptor, and the second best is QWT; while both DWT and DW- 
PTBB have nearly equal AUC values. Comparing with ITT and PSS, QW- 
PTBB's performance in AUC measurement is nearly equal to that of ITT 
and much larger than PSS. The result strengthens our hypothesis about 
usefulness of informative clues in saliency map construction. Moreover, it 
suggest that sub-band wavelet descriptors would be better pixel-based de- 
scriptors for scale-saliency computation. In terms of NSS, QWT has slightly 
out-performed the other descriptors, and its value nearly approaches NSS 
result of ITT and obviously surpasses PSS's result. 

The graph [9] and the table [4] shows numeric evaluation of the proposed 
methods with WSS scale section mechanism on Kootstra's database. Besides 
WSS scale selection, we have another method called DIS; therefore, we should 
compare how DIS performs on the same database with suggested sub-band 
descriptors. Therefore, similar quantitative assessments are also done for 
wavelet scale saliency with DIS scale selection and simulation results are 
shown in figure 10 and table [5j 



ROC Curves 
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Figure 10: ROC Curve - DIS 



With this specific simulation parameters, the method still performs quite 
well against other methods like ITT and PSS in both terms of AUC and 
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MTH 


AUC 


NSS 


TIME(s) 


ITT 


0.7819 


0.5144 


2.3874 


PSS 


0.5852 


-0.3532 


17.0663 


DWT 


0.7058 


0.4847 


0.4169 


DTCWT 


0.7173 


0.5027 


2.8972 


DWPTBB 


0.7173 


0.4556 


4.8156 


DTCWPTBB 


0.7381 


0.3468 


4.7152 



Table 5: Quantitative Result 



NSS. However, both AUC and NSS of DIS are slightly worse than those of 
DIS methods. Noteworthy, ITT analyses three channels color, intensity and 
orientation simultaneously while we just utilize a intensity channel. Only 
one channel is chosen since we try to isolate the performance of wavelet scale 
saliency from other external effects such richness of input features. Regardless 
of DIS or WSS, the method has very competitive results in numeric terms 
and it is not due to comprehensiveness of input features. 

8. 3. Qualitative Comparison of Saliency Methods 

In this section, we show a few examples of visual saliency maps from two 
mentioned database of Bruce and Kootstra. From each of the databases, 
only four test images are chosen to be displayed due to limited space though 
saliency maps are generated for every single image in either of the databases. 
The samples are intentionally chosen to show variety of contexts and scenes 
as well as they cover cases of successfully highlighting interested objects 
and cases of failing to emphasize salient regions. Along with the proposed 
methods, saliency-maps of ITT and PSS methods are also included so as to 
give visual comparisons to our proposed saliency methods. Directly below 
are displayed four samples from Neil Bruce's database. 

There are four samples , shown in figures TT|l2|13p4 for qualitatively 



analysing. Generally, PSS identifies a large portion of images as salient re- 
gions ( white regions ), it explains why its average AUC and NSS in the table 
[2] are the lowest, and ITT model gives reasonable saliency maps for three 
over four samples. Four samples of saliency maps are deliberately chosen to 
show that different ranking of WSS, DIS derivatives, and their dependence 
on mother wavelet morphological shapes. Sometimes, their performances 



are quite similar, the figure 11 however, QWT-WSS performs better than 



DWT-WSS in many samples; for example, figure [7j DWPTBB-WSS and 



39 



Sample 


ITT 




PSS 






/q 




;Tf'>v?.. \ pi*** 




DWT-WSS 


QWT-WSS 




DWTBB-WSS 


QWPTBB-WSS 


■ 


r.-r 




•-.-«■ 

' : t *• 

f '* - 




DWT-DIS 


QWT-DIS 




DWTBB-DIS 




QWPTBB-DIS 




• «^ • fro . y^f-v^f/- • 

./ 1 




f " "" ' " ~~ 
/ — 

1 

~ A ■ 

> 1 
- 




1 1 1 1 




i 








J 



Figure 11: Saliency Map 1 
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Figure 14: Saliency Map 4 
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903 QWPTBB-WSS also have their own advantages, especially in the case tex- 

904 tured background - figure [13} Finally, sometimes none of proposed methods 

905 do give reasonable saliency maps, figure 14 It usually happens if images are 

906 flooded with complex textures. 

907 While Neil Bruce's data-set capture daily scenes in the urban and subur- 

908 ban areas, it lacks of scenes from natural landscapes. Therefore, its images 

909 do not represent the whole meaning of "natural images" category. In order 

910 to visually confirm effectiveness of our proposed methods, we include four 

911 "natural" samples with corresponding saliency maps from the Kootstra's 

912 database in the following figures 15 16} 17 and 18 
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Figure 15: Saliency Map 1 



In accordance with one sample image (in color), we display saliency maps 
produced by ITT , PSS , and eight derivatives of the proposed wavelet scale 



saliency methods. The figures 15 and 18 represents image of flowers taken in 
close distance. Therefore, it shows quite a number of symmetric and small 



917 details. Meanwhile, figures 16 and 17 contains the whole landscape of moun 
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Figure 16: Saliency Map 2 
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918 tains and plateaus. Those scenes are usually anti-symmetric and much richer 

919 in information than flowery photos. In general observations, ITT method 

920 does the best job in selecting the salient features. Among derivatives of the 

921 proposed features, QWPTBB descriptors show the most competitive and 

922 comprehensive visual result, followed by QWT, DWTBB, and DWT based 

923 derivatives in descending order of performances. For comparison between 

924 WSS and DIS scale-selection mechanisms, there is slight but significant dif- 

925 ference between their saliency maps - WSS saliency maps in the second rows 

926 and DIS saliency maps in the third rows of the figures 1~5|16||17 and 18 In 



928 



927 these examples, DIS maps tend to highlight more features than those of WSS; 



in other words, WSS maps might have better discriminant power than DIS 

929 maps. It would explain why AUC and NSS results in the table [4] are slight 

930 better than those in the table [5j There are small changes in quantitatively 

931 visual results when different parameters are used. However, the proposed 

932 methods performs very well against other saliency methods like ITT and 
PSS. 



933 



934 9. Conclusion 



935 In this paper, we propose the extension of scale saliency from pixel de- 

936 scriptors to sub-band energy density descriptors generated by four DWT, 

937 DWPTBB, QWT, and QWPTBB wavelet transforms with two different scale 

938 selection mechanisms WSS and DIS. Comparing to pixel- value descriptors 

939 (PSS), the proposed descriptors are much more sparse but biased toward 

940 morphological shapes of mother wavelet. Moreover, the proposed descriptors 

941 are more robust to external influencing factors to generation of saliency maps 

942 such as shift-variance and other affine transformation. Furthermore, wavelet 

943 packet descriptors with best basis algorithms are also considered since several 

944 psychological experiments suggest sparseness factor in human vision system. 

945 Along with new descriptors, innovative coherent information framework for 

946 wavelet scale saliency is proposed and strong relations with Bayesian Sur- 

947 prise Model [H] are emphasized. Beside solid theoretical development, the 

948 experimental results are as well competitive with state-of-the-art ITT model 

949 and surpasses the original scale saliency model PSS quantitatively and qual- 

950 itatively. In future research, theoretical analysis will be extended to include 

951 prior information or top-down information, perceptual grouping and other 

952 visual attention operations. 
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